Overview
The EDL Pipeline uses aggressive multi-threading to fetch data for ~2,775 stocks efficiently. Each script is optimized with specific thread counts and timeout values based on endpoint characteristics and observed rate limits.Thread Configuration by Endpoint
Each script uses different thread counts optimized for the specific endpoint:| Script | Threads | Timeout | Rationale |
|---|---|---|---|
fetch_company_filings.py | 20 | 10s | Dual-endpoint fetches (2 requests per stock) |
fetch_new_announcements.py | 40 | 10s | Small payloads, fast responses |
fetch_advanced_indicators.py | 50 | 10s | Lightweight data, high throughput |
fetch_market_news.py | 15 | 10s | Rate-sensitive endpoint (429 errors observed) |
fetch_all_ohlcv.py | 15 | 15s | Large historical data, chunked requests |
fetch_fundamental_data.py | N/A | 30s | Sequential batching (100 ISINs per batch) |
Detailed Thread Analysis
High Concurrency (40-50 threads)
Scripts:fetch_new_announcements.py- 40 threadsfetch_advanced_indicators.py- 50 threads
- Small request/response payloads
- Stateless endpoints
- Fast API response times (under 500ms)
- No observed rate limiting
- Completes 2,775 stocks in ~3-5 minutes
- Success rate: >99%
- Minimal retries needed
Medium Concurrency (15-20 threads)
Scripts:fetch_company_filings.py- 20 threadsfetch_market_news.py- 15 threadsfetch_all_ohlcv.py- 15 threads
- Larger response payloads
- Multiple requests per stock (filings: 2 endpoints)
- Rate-sensitive APIs (news endpoint)
- Historical data chunking (OHLCV)
- Filings: ~8-12 minutes for 2,775 stocks
- News: ~10-15 minutes for 2,775 stocks
- OHLCV: ~30 minutes for 2,775 stocks (chunked historical data)
Sequential with Batching
Script:fetch_fundamental_data.py
- Endpoint supports batch requests (100 ISINs per call)
- More efficient than parallel single-ISIN requests
- Prevents overwhelming the endpoint
- Completes 2,775 stocks in ~15-20 batches
- Total time: ~2-3 minutes
- Success rate: >99.5%
Batching is 10x faster than parallel single-ISIN requests for this endpoint.
Timeout Configuration
Timeout Values
| Timeout | Use Case | Scripts |
|---|---|---|
| 10s | Standard API calls | Most fetch scripts |
| 15s | Large responses (OHLCV, NSE CSVs) | fetch_all_ohlcv.py, fetch_complete_price_bands.py |
| 30s | Batch fundamental data | fetch_fundamental_data.py |
Why Timeouts Matter
Rate Limit Handling
HTTP 429 Detection
Only the Market News API shows rate limiting behavior:- Too many requests from same IP in short window
- Burst traffic during parallel execution
- Limited to 15 threads (not 40-50 like other endpoints)
- 2-second backoff on 429 response
- Automatic retry in next run
Exponential Backoff (Best Practice)
For production implementations, consider exponential backoff:Batch Delays
Inter-batch Sleep
Fundamental Data:- Be polite to server
- Prevent triggering rate limits
- Smooth out request spikes
- Adds ~10-15 seconds total to runtime
- Prevents API throttling
- Improves overall reliability
No Delays for Threaded Scripts
Threaded scripts (announcements, indicators, etc.) don’t use inter-request delays because:- ThreadPoolExecutor naturally staggers requests
- Individual timeouts provide natural pacing
- No rate limiting observed at current thread counts
Progress Monitoring
Real-time Progress Updates
All multi-threaded scripts show progress:Retry Strategies
Implicit Retry (Re-run Script)
Most scripts use a simple retry strategy:No Automatic Retry
Scripts do NOT automatically retry failed requests within the same run because:- Keeps code simple and predictable
- Failures are often due to missing data (not transient errors)
- Re-running the script is sufficient for transient failures
Best Practices
Use documented thread counts - these are optimized through production testing.
Always set timeouts - prevents hanging on slow/dead connections.
Monitor progress - print updates every 50-100 stocks to track execution.
Handle 429 gracefully - implement backoff for rate-sensitive endpoints.
Batch when possible - 100 ISINs per batch is 10x faster than individual requests.
Add inter-batch delays - 500ms sleep prevents triggering rate limits.
Skip existing files - avoid re-fetching unless FORCE_UPDATE is enabled.
Log all failures - track error counts to identify systemic issues.
Performance Benchmarks
Pipeline Completion Times
| Script | Thread Count | Stocks | Avg Time | Req/Second |
|---|---|---|---|---|
| Advanced Indicators | 50 | 2,775 | 3-5 min | ~9-15 |
| Announcements | 40 | 2,775 | 4-6 min | ~8-12 |
| Company Filings | 20 | 2,775 | 8-12 min | ~4-6 |
| Market News | 15 | 2,775 | 10-15 min | ~3-5 |
| OHLCV (Historical) | 15 | 2,775 | 30-40 min | ~1-2 |
| Fundamental Data | Batch | 2,775 | 2-3 min | ~15-20 |
Troubleshooting
Too Many Timeouts
Symptoms:- High error count
- “Timeout” messages in logs
- Reduce thread count by 25-50%
- Increase timeout value (10s → 15s)
- Check network connectivity
HTTP 429 Rate Limits
Symptoms:- “rate_limit” in error messages
- Sudden spike in failures
- Reduce thread count (e.g., 40 → 20)
- Add delays between requests
- Implement exponential backoff
- Wait 5-10 minutes before retrying
Incomplete Data
Symptoms:- Success count < total stocks
- Missing JSON files for some symbols
- Check error_count in final report
- Re-run script (skips existing files)
- Enable FORCE_UPDATE to refresh all
- Check if endpoint is down for specific stocks
Memory Issues
Symptoms:- Script crashes on large datasets
- “MemoryError” exceptions
- Reduce thread count to limit concurrent memory usage
- Process in smaller batches
- Free memory between batches